Optimal Markov Approximations and Generalized Embeddings 



Detlef HolsteiiE and Holger Kantz 
Max Planck Institute for the Physics of Complex Systems, Nothnitzer Str. 

(Dated: August 11, 2008) 



01187 Dresden, Germany 



oo 

o 

O 

(N 

< 



Q 

U 
d 



> 
m 

in 

OO 

o 

oo 
o 



Based on information theory, we present a method to determine an optimal Markov approximation 
for modelling and prediction from time series data. The method finds a balance between minimal 
modelling errors by taking as much as possible memory into account and minimal statistical errors 
by working in embedding spaces of rather small dimension. A key ingredient is an estimate of 
the statistical error of entropy estimates. The method is illustrated with several examples and the 
consequences for prediction are evaluated by means of the root mean squard prediction error for 
point prediction. 
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I. INTRODUCTION 

Given is a univariate time series {xi : i = 1,. .. ,N}, 
obtained from the time evolution of some determinis- 
tic or stochastic dynamical system by applying a scalar 
measurement function to the state vectors of this sys- 
tem. We will assume that the measurements are equidis- 
tant in time. A meanwhile standard approach to the 
modelling and prediction based on univariate time se- 
ries data starts from the construction of a multidimen- 
sional state space. Commonly used is the time delay 
embedding space. In the case of deterministic dynam- 
ics, the Takens theorem [l[ states that if the embedding 
dimension to satisfies m > 2Dj, where Df is the frac- 
tal dimension of the attractor, then m-dimensional delay 
vectors (x n -k(m-i),x n -k{m-2),- ■■,%») with delay k can 
be uniquely mapped onto the non-observed state vectors. 
Hence the process in the m-dimensional delay embedding 
space is deterministic in the sense of the existence and 
uniqueness of the solution of the initial value problem. 
In special cases, smaller values of to might be sufficient 
for reconstruction of the underlying dynamics. 

As it has been argued recently 0, also the mod- 
elling and prediction of time series data from stochastic 
processes can profit from the concept of state space re- 
construction: In an ideal situation, there exists a time 
delay embedding space, in which the stochastic dynam- 
ics is Markovian of (possibly higher) order m, i.e., in 
which the conditional probability density function (pdf) 
p(x n \x n - m , . . . , x n -i) to find a given future value cannot 
be made narrower by including more past values into the 
condition. In the framework of time series analysis, the 
conditional pdf has to be estimated from the data. This 
can be done by either estimating conditional probabilities 
through binning and counting 3] , or by kernel estimators 
Two consequences arise: These estimates are subject 
to statistical errors, and a length scale e is introduced, 
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i.e., the estimated conditional probabilities do not vary 
as a function of the condition (x„_ m , . . . , £ n _i) on length 
scales smaller than e. The statistical error is a function of 
not only the dataset size N but also of the spatial resolu- 
tion e. When models have to be fitted to observed data, 
model parameters are to be determined. The estimated 
conditional probabilities can be interpreted as the model 
parameters of a Markov model. In data analysis tasks the 
Markov order m, however, usually is not a priori known, 
and has to be obtained from the data. 

In both the deterministic and the stochastic cases, find- 
ing the suitable embedding dimension is one of the prac- 
tical issues. In the stochastic case the embedding dimen- 
sion can be associated with the number of time steps 
of nonvanishing condition, which under absence of in- 
termediate time steps of vanishing condition reduces to 
the Markov order to. Whereas for the deterministic case 
mathematically rigorous results |5| as well as numerically 
efficient and reliable algorithms [fj exist, for the stochas- 
tic case only statistically demanding tests of the Chap- 
man Kolmogorov equation are currently in use [?|]- In 
both cases, there exists the practical problem that from 
a theoretical point of view the embedding dimension for 
the process could be very large. If the amount of data 
is insufficient in view of statistical robustness of either 
the algorithms to determine empirically the embedding 
dimension or of estimates of, e.g., model parameters in 
a corresponding space, then a high dimensional model 
is practically irrelevant, even if theoretically justified. 
Hence, in many situations an effective model and a differ- 
ent embedding dimension might be superior to the model 
advised by the structure of the underlying dynamics. We 
will illustrate this statement later and we will convince 
the reader of its relevance. 

When identifying the optimal embeddings, i.e., when 
looking for the optimal Markov approximations, we have 
to take into account two types of errors. The first one 
is a modelling error, which we make if we ignore compo- 
nents in the past of the time series which are relevant for 
its future. In the deterministic setting, this would mean 
that we use an embedding dimension which is too small. 
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In the stochastic setting, it means that the Makovianity 
of in general higher order or the cardinality of time steps 
of nonvanishing condition is not fully captured by the 
chosen embedding space. The second error is a statisti- 
cal error. Regardless of which quantity is estimated from 
a finite dataset, its value is always subject to a statisti- 
cal error. In the context of prediction and modeling, the 
corresponding samples are usually obtained from neigh- 
bourhoods of delay vectors. Sample sizes are small and 
correspondingly statistical errors large, if we work in em- 
bedding spaces whose dimension is too large compared 
to the amount of available data and compared to the di- 
ameter of neighbourhoods, i.e., locality of the estimate. 

Hence, what we are proposing here with the intention 
of a rather general applicability is a concept to identify 
optimal resolution-dependent Markov approximations, in 
which the combined effect of modeling errors and statis- 
tical errors is minimal. 

Practically, we will relate the modelling error to the 
discrepancy of conditional entropies from entropies with 
sufficient conditioning. The statistical error of a model 
will be related to the statistical error in entropy estima- 
tion. Therefore, we will carry out explicitly error esti- 
mates for entropy estimations. On the route of searching 
for models which capture the memory of a process but 
involve an as small as possible embedding dimension, we 
will consider also non-standard so-called perforated em- 
beddings, namely those, where the temporal spacings be- 
tween successive elements of a delay vector are not iden- 
tical for all pairs of adjacent components. Such embed- 
dings were also discussed in [8| and [9( . This paper makes 
a new suggestion how to find optimal ones. 

In the next section the basic quantities of information 
theory are introduced, which in the development of the 
criteria for optimal Markov approximations play a cer- 
tain role. In Sec lIIII we remind a widely used procedure 
for the estimation of entropies and discuss the statisti- 
cal errors in numerical estimation of the correlation en- 
tropy. In Sec lIVI a novel method for the selection of usual 
Markov approximations is presented, but it is immedi- 
ately pointed out that the framework has to be gener- 
alized in order to be suitable for arbitrary dynamics. A 
unified notation for entropies in the time series analytical 
framework suitable for the treatment of variable future 
lead times, jointly conditioned joint entropies, noncausal 
conditionings, downsampling and arbitrary omissions in 
conditionings is introduced in Sec|Vl which remedies the 
formerly mentioned problems. The notion of perforat- 
edness is introduced. In Sec I VII we present the method 
to identify optimal generalized Markov approximations 
as a function of the data accuracy e for a given time se- 
ries of fixed length N. Subsequently, the success of the 
introduced criterion for the determination of optimal per- 
forated Markov approximations is illustrated for several 
model processes with memory in Sec lVIII We show that 
indeed the theoretically optimal embedding of the pro- 
cess from the dynamical law behind the generated data 
sets is not necessarily the optimal state space representa- 



tion of a finite time series for all resolutions. Furthermore 
the dependence on the length of the underlying dataset 
is discussed in detail. Some consequences for prediction 
with the example of the generalized Henon map are out- 
lined in Sec lVIIII In SeclIXIthe results of this paper are 
concluded. 



II. RELEVANT QUANTITIES OF 
INFORMATION THEORY 



The resolution(e)-dependent joint Renyi block entropy 
of order q is given by 



i 1= i i m =i 



(i) 



It estimates the joint uncertainty of random variables 
corresponding to m successive time steps of a time series. 
In case of dependences of random variables a conditional 
probability distribution is narrower than the correspond- 
ing unconditioned probability distribution. Further con- 
ditioning into the past in general further decreases the 
width of the distribution and the uncertainty of the out- 
come of the random experiment. This behaviour can also 
be quantified with conditional entropies defined by 



-ffilm(e) : = H m+ i(e) - H m (e) 



(2) 



as the difference of joint block entropies with different 
block length. In this formula and the subsequent ones 
the Renyi order q is notationally omitted. Conditional 
entropies are interpreted as the remaining uncertainty 
after having used the information from the chosen con- 
ditioning. In case of maximal, i.e. infinite conditioning, 
the unreducable uncertainty is obtained as 



#i|oo(<0 = lim i?i| ro (e) 



The redundancy is defined by 



i? m (e) := Hx{e) - H 1{m (e) , 



(3) 



(4) 



and hence is interpreted as the uncertainty reduction of 
the immediate future random variable from conditioning 
on the adjacent m past time steps. The quantity 



t (e) := #i| m (e) - #i|oo(e) 



(5) 



is called ignored memory. It is the in principle accessible, 
but renounced uncertainty reduction of the immediate fu- 
ture random variable. From combining Eq.(f4]) and Eq.([5|) 
it is possible to see that the total uncertainty of a single 
random variable is decomposable according to 



H x {e) = H lloo {e) + R m {e) + Q, 



(6) 
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III. ESTIMATION OF ENTROPIC QUANTITIES 
AND STATISTICAL ERRORS OF 
CORRELATION ENTROPIES 

Our method for finding optimal Markov approxima- 
tions will find a balance between maximal uncertainty re- 
duction of future values of the time series and minimized 
statistical errors. To this end, we will discuss here the 
estimation of entropies from finite time series and in par- 
ticular the statistical errors involved in this estimation. 
Because it is the statistically most robust and algorithmi- 
cally most convenient quantity, we will concentrate here 
on the order-2 Renyi entropy H < - q=2 ' ) = — lnX^li which 
is estimated from the Grassberger-Procaccia correlation 
sum [6j 



e-Hxi-Xfcll) (7) 



by 



lnCl 2 >(iV,e) 



(8) 



We use the maximum norm in the argument of the Heav- 
iside function 8. The conditional entropy of Eq.([2]), 
our construction element for ignored memory and redun- 
dancy, is obtained from 



H[f m (TV, e) = In C<? (N, e) - In C^ +1 (N, , 



(9) 



As it was shown by Grassberger [lfj | , the correlation sum 
does not suffer from systematic finite sample effects, i.e., 
it is an unbiased estimator of the correlation integral. 
Consequently, the mean value of estimated quantities 
such as the correlation entropy or the correlation dimen- 
sion on data sets of fixed size N for arbitrarily small e will 
be correct, as long as the combination (N, e) is such that 
the correlation sum is non-zero. However, each individ- 
ual result is subject to statistical errors. In the following 
we want to estimate the magnitude of these errors. 

To begin with, it is introduced the random variable 
W m (e, Xj) for the number of similar vectors x^ of x, with 
distance smaller than e according to the chosen norm, 
i.e., the random variable for the cardinality of the set 
{xfe G U(e,Xi) : k G {1,...,N — m}}, where the e- 
neighborhood of the vector x is defined by 



W(e,x) := {z : ||z - x|| < e} 



(10) 



Wm(e,Xi) is distributed according to a Binomial distri- 
bution. For a given dataset, the realization of W m (e,Xj) 
is given by 



w, 



t (7V,e,x 2 ):= e(e-||x i -x fc ||). (11) 



With this expression the correlation sum (Eq.([7])) can be 
written as 

C${N,e) = N( ^_ ^> m (7V,e,x,) . (12) 



Except for very large e or extremely small N the distribu- 
tion of the random variable W m (e,Xi) can be excellently 
approximated by a Poisson distribution. This leads to 
the property 

Var(W m (e,x i )) = E(W m (e, Xl )) , (13) 

and therefore 



Aw m (N,e,Xi) w y/w m (N,e,ia) . (14) 

Assuming mutual independence of W m (e, Xj) and using 
the standard rules for error propagation (additivity of the 
variances) as well as the approximate relation (fl4|) . the 
statistical error of the correlation sum is estimated by 



AC^(N,e) = ^i-^^(A Wm (7V,e,x 4 )) 2 

"whr)^^'^- (15) 

Thus it can be computed by using the non-normalized 
correlation sums, which are needed anyway to estimate 
entropies. From Eqs.([8]), (fi"2"|) and p5]) the statistical 
error of the Renyi entropy can be calculated as 



A^(iV,e) = 



ACi 2) (7V,e) 



1 



Ci 2) (TV, e) ^/J2 z w m (N,e, Xl ) ' 



(16) 

and the statistical error of the usual conditional entropy 
is obtained from 



AH[fjN,e) = V[AF^ x (iV,e)] 2 + [AH%>(N,e 



r( 2 



(2) 



1 



1 



(17) 

Further error propagation for the estimated redundancy 
AR rn (N, e) 

= y/[AHiN, e)] 2 + [Aff m+1 (iV, e)] 2 + [Aff m (iV, e)] 2 (18) 

is possible in the same way. This quantity will be needed 
for the criterion given in Eq. tflT))) . 

The assumptions entering the arguments for usual er- 
ror propagation are: 

1. Independence of the random variables 

2. Gaussian error statistics 

3. Errors are small so that nonlinear expressions can 
be approximated by first order Taylor expansions 
around the mean. 

The authors are aware of the fact that item 1 is violated, 
since if ||xj — Xj/[| < e, then the phase space points have 
overlapping neighborhoods and W m (e, Xj) and W m (e, x,/) 
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are not independent of each other. Correlations among 
the W m (e, Xj) yield a smaller effective sample size, such 
that Eq. (fT6|) is an underestimation of the true statistical 
error of entropies. The violation of the assumption of 
item 1, however, becomes the less relevant the smaller e, 
since then the overlap of neighborhoods decreases. Item 
2 is violated, since the error statistics of our basic ran- 
dom variables W m (e, Xj) is explicitly non-Gaussian. This 
violation becomes the stronger the smaller the values of 
w m (N,e,Xi) become, i.e., for small e. Nevertheless, in 
spite of those arguments, usual error propagation is used 
as an approximation of the true errors of the estimation 
of entropic quantities. 

Except for Eq. ([22]) in the following the dependence on 
the length N of the dataset will only be shown for the sta- 
tistical errors, since for the expectation value of entropies 
and derived quantities there is no dependence on N. 



IV. A NOVEL CRITERION FOR USUAL 
MARKOV APPROXIMATIONS 

As already mentioned in the introduction, there are 
two kinds of errors involved in our strategy for the deter- 
mination of optimal Markov approximations: 

First, there is a modelling error. If a Markov approx- 
imation is carried out, typically information about the 
future is truncated, which is not anymore available for 
uncertainty reduction. This renounced uncertainty re- 
duction can be quantified by the ignored memory Q m 
given in Eq.©. The value of Q m should be small. It 
is the smaller (or remains the same) the more compo- 
nents in the past are taken into account, i.e., the higher 
the order m of the Markov approximation. Naively, one 
could be tempted to demand that the ignored memory 
in optimal Markov approximations should vanish, but in 
case of infinite range of memory in the past the resulting 
Markov order would be infinite, what cannot be desired 
with respect to practical applications. 

Second, a statistical error has to be discussed. There 
is an unavoidable statistical error in the estimation of en- 
tropies, which is propagated to a nonvanishing statistical 
error of the performed uncertainty reduction from condi- 
tioning. This statistical error quantified by AR m given 
in Eq. (|18p describes the unreproducibility of uncertainty 
reduction. Also this term should be rather small in or- 
der to make the uncertainty reduction confident. Ai? m 
increases with larger Markov order m, because less neigh- 
bors are found in the estimation of the correlation sum 
under the more restrictive conditions. Demanding only 
the minimization of the statistical error of the redun- 
dancy in a criterion for optimal Markov approximations 
would thus lead to empty conditionings. It is intuitively 
clear that also this can in general not be a reasonable 
solution. 

Since in contrast to Q m the term AR m is the smaller 
the fewer past components are taken into account, the 
reduction of both errors are complementary demands and 



one faces an optimzation problem. The aim is now to give 
a criterion such that for arbitrary dynamics a resolution- 
dependent optimal Markov approximation can be found. 
The ad hoc choice for such a criterion reads: 

m opt (e) = max{m e N : Ai? m (JV, e) < Q m (e)} . (19) 

The reason for the criterion can be understood from the 
following: The maximal memory, which in a senseful way 
to take into account is restricted by the condition that the 
statistical error of redundancy, i.e., the statistical error 
of the uncertainty reduction has to be smaller than the 
ignored memory. Otherwise the ignored memory is any- 
way not anymore resolvable by enlargement of the order 
of the Markov approximation. It is used that the statis- 
tical error of the redundancy increases with the Markov 
order, whereas the ignored memory decreases with the 
Markov order. Hence starting from the smallest possible 
Markov order m it is increased as long as the statistical 
error of the redundancy remains smaller than the ignored 
memory. 

The whole reasoning is resolution-dependent. In the 
Markov model conditional probabilities 

P*„|i„_ m ,...,i„-i( e ) 

J pi&n |*^n — m ) * * • 3 *^n— 1 ) dx n — m . . . dx n , (20) 

Si„_ m> ...,i„(e) 

that a state inside some e-subset of the state space is 
mapped onto some e-interval corresponding to the fu- 
ture, are treated. B% n _ m i n (e) is a resolution-dependent 
(m+l)-dimensional box as an element of the partition of 
the underlying embedding space. The result are approxi- 
mations to the true conditional probability density, which 
vary only on spatial scales which are larger than e. E.g., 
a model obtained for relatively small e has the potential 
to represent very fine structures in the state space, but 
it suffers from poor statistics. Since for larger e statistics 
gets better, but only coarser structures are resolved, the 
optimal Markov order (and later on the optimal perfo- 
rated model), as well as prediction errors which will be 
discussed in Sec. IVIII1 depend on the spatial resolution e. 

Eventually we want to make plausible that the crite- 
rion Eq. (|19[) for m opt derived from information theory 
really yields the optimal order of the Markov model de- 
scribing the underlying dynamics. The conditional prob- 
abilities of Eq. (|2"0)) corresponding to a Markov model of 
order m are estimated from a finite dataset. Hence they 
are subject to statistical errors, which are the larger the 
larger m. Exactly the same statistical errors of con- 
ditional probabilities would lead to statistical errors of 
the redundancy Ai? m , if we defined all information the- 
oretic quantities through Shannon entropies (q = 1), and 
they enter indirectly the statistical errors of quantities 
based on the Renyi-entropy of order q — 2 through Aw m 
(cmp. Eqs. (fT4|) and (jTTjl ). Hence the statistical error 
of the redundancy is related to uncertainty of the corre- 
sponding Markov model. Since furthermore with increas- 
ing conditioning the minimization of the ignored memory 
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Q m is in accordance with the minimization of the mod- 
elling error of the Markov model the plausibility argu- 
ment is complete. 

As an example the autoregressive (AR) process 



i=0 



(21) 



of order p = 3 with parameters ao = 0.2; cti = 0.3; 
ci2 = 0.4 is treated, i.e., a memory depth of three time 
steps is used. As usual £ n+ i is Gaussian white noise 
with unit variance and zero mean. A dataset of length 
N = 50000 is used. The result is shown in FigCD 
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upper panel of FigQ] For higher resolutions, i.e., smaller 
e, the statistical error dominates the criterion and a trun- 
cation for shorter Markov order is enforced. This means 
that although the data stem from a process of Markov 
order p = 3, for the given dataset size and chosen e 
an m — 2 model is superior when estimated from the 
data. For large e a suggestion for a larger Markov order 
can be found. The order of the Markov property given 
by Eq. ([2Tj) is also not preserved under coarse graining, 
which can be observed by the splitting of entropies with 
higher conditioning in the lower panel of FigfTJ because 
for coarser resolutions the mapping onto discrete states 
becomes noticable and causes extra dependences among 
the involved random variables shifting information about 
the future into the further past. Since for coarse resolu- 
tion the statistical error is extremely small, the splitting 
of entropies is detected by the criterion as shown in the 
upper panel of FigfT] 

Whereas the application of the criterion given by 
Eq. (|19[) was successful in the previous example, prob- 
lems do arise in case of more general dynamics. E.g., 
the discretized Mackey-Glass dynamics to be discussed 
in Sec lVIIBl lcads to a memory structure with omissions, 
i.e., certain intermediate time steps in the past do not 
contribute to uncertainty reduction of the future. Under 
the conditions of this section minimization of the mod- 
elling error is in general accompanied by large statistical 
errors such that true joint minima of both types of er- 
rors are not accessed. Hence a more subtle procedure for 
obtaining optimal Markov approximations should be nec- 
essary, in which the minimization of the modelling error 
by contributions from the further past is not statistically 
suppressed. A notation for joint entropies on time series 
segments with omissions has to be introduced. We call 
such situations 'perforated', which are worked out in the 
next section. As we will see, on the other hand, the per- 
forated framework introduces the new problem that with 
respect to a criterion for an optimal Markov approxima- 
tion a monotonicity of the relevant entropic quantities 
in a parameter as the Markov order m in the former 
case describing all possible conditionings is not anymore 
available. A solution with a qualitatively slightly differ- 
ent generalized criterion, which nevertheless follows es- 
sentially the same idea as in this section, will be offered 
in SecED 



FIG. 1: (Color online) Upper panel: Suggestion for resolution- 
dependent usual Markov approximations for the autoregres- 
sive process. Furthermore the conditional entropies -Hi| m (e) 
with varying conditioning as a function of the resolution and 
the corresponding resolution-dependent statistical error in the 
estimation of the conditional entropies are shown. Lower 
panel: Zoom of the upper panel with the intention to make 
visible the splitting of entropies for rather coarse resolutions, 
which is detected by the algorithm. 

For intermediate resolutions the memory depth m — 
p = 3 is exactly found with the algorithm, visible in the 



V. PERFORATION 

Whereas in Eq. ([T]) the uncertainty of m random vari- 
ables corresponding to successive time steps in a time 
series is assessed, in this section a notational frame- 
work for evaluation of uncertainties of random variables 
corresponding to arbitrary sets of time steps is intro- 
duced. Instead of the number m of successive time steps, 
which is not anymore enough for characterization of the 
uncertainty-assessed set of time steps, the relevant set 
has to be given explicitly. We will denote such sets of 
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integers by J (or K) and they will be called perforated, 
if omissions of time steps are involved. E.g., Eq.© for 
the estimation of order- 2-Renyi entropies has to be gen- 
eralized for the perforated case by 

(W,e) = -lnCf (N,e), (22) 

where the vectors Xj, Xfe in Eq.([7]) for the correla- 
tion sum adopt the perforation structure given by J, 
i.e., if J = {ji, j2, ■ ■ ■ , j\j\}, then the correspond- 
ing generalized delay vector with index i reads Xj = 
(xi + j 1 , Xi + j 2 , . . . , Xi+j,j,). This in general leads to non- 
standard embeddings. 

Conditional entropies can be defined in general as 

H%j(e):=HgUe)-H?\e). (23) 

where K is a set of integers which is disjoint from J. This 
quantity in principle allows for the evaluation of entropies 
with noncausal conditioning. In prediction situations the 
convention is made that the presence is indicated by the 
index zero. Hence a set J of conditioning indices in the 
past only consists of negative integers (J C Zq ) , which 
indicate the respective distances to the presence. With 
respect to optimal Markov approximations we are inter- 
ested in single element sets K. In this case the single 
element denoted by / corresponds to a certain future 
time step, and Eq. (|2"3"|) reduces to 

H$ }lJ (e) := H$ }UJ (e) - HP (e) . (24) 

With conditioning on full past for a single time step / in 
the future the condtional entropy becomes H^y^-(e). 
As a special case of one step ahead the nonperforated 
conditional entropy with infinite conditioning of Eq.© 
is obtained: 

H {1}{z -(e) = H 1]0O (e). (25) 

Under perforated circumstances the ignored memory of 
Eq.© is redefined by 

Q {fh j(e):=H {f}lJ (e)-H {m -(e), (26) 

and the redundancy of Eq. © now is obtained from 

R Uh j(e) := ffi(e) - . (27) 

As a generalization of Eq. (fl~8f , the statistical error of the 
redundancy in Eq. (|2~T|) . which will be essential for the 
novel criterion for optimal perforated Markov approxima- 
tions, still obtained from usual error propagation, reads 

Ai? {/};J (A,e) 

= J[Afh(N, e)P+ [Aff {/}uJ (A, 7fi+ [AH 3 (N, e)f (28) 

After having fixed the notational framework for a perfo- 
rated treatment, a suitable generalization of the criterion 
for optimal usual Markov approximations of Sec II VI can 



be given such that simultaneous minimization of as well 
the modelling error as also the statistical error makes 
sense also for generalized dynamics containing inhomo- 
geneously distributed memory in the past. Moreover, 
the new notation in principle allows for a treatment of 
variable future time steps, jointly conditioned joint en- 
tropies, arbitrary omissions in conditionings, noncausal 
conditionings and downsampling in a unified framework. 



VI. A NOVEL CRITERION FOR OPTIMAL 
GENERALIZED MARKOV APPROXIMATIONS 

Also in the perforated case we consider the two types 
of errors already discussed in the context of usual Markov 
approximations, i.e., the modelling error and the statis- 
tical error, which have to be minimized jointly. As in 
Sec lIVI the two errors are again quantified by the ignored 
memory, i.e., ignored potentially usable information, and 
the statistical error of redundancy, however, in this case 
with usage of variants of those quantities respecting per- 
foratedness as introduced in Sec|V] The minimization of 
the single errors is again complementary in the number 
of conditioning indices, but more subtle here, because 
in particular the ignored memory is not only a function 
of the cardinality of the conditioning set, but depends 
explicitly on its single elements. 

For finding the resolution-dependent optimal perfo- 
rated Markov approximation, i.e., the optimal condition- 
ing sets J*(e), as the central criterion and most important 
formula of this paper it is demanded 

Q {fhJ (e)+b-AR {f} , J {N,e)lmux, (29) 

where for given e the minimum is taken in principle over 
all possible, practically over all numerically accessible 
conditionings J C Z,q , instead of over all Markov orders 
m as in Sec lIVI The ignored memory Q{/} ; j(e) in the 
perforated case was defined in Eq. (f2"6"|) and the statistical 
error of the redundancy AR{fy.j(N,e) is obtained from 
Eq. (f2"8")) . The parameter b accounts for the weight of the 
statistical error of the redundancy in the criterion. How- 
ever, all results of Sec lVIII will be based on the choice 
6=1. A short discussion on balance factors b ^ 1 can 
be found in Sec. 7 of If the solution for a certain e 
is not unique, it is taken in a second step the set J(e) as 
J*(e) with 

min(J(e)) = max (30) 

among the preselected ones. 

The result is a resolution-dependent suggestion for op- 
timal perforated Markov approximations. The chosen 
criterion will obtain its justification by the ability to re- 
cover known models behind sufficiently large data sets in 
a suitable intermediate interval of resolutions shown in 
SeclVUl 
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For the criterion of Eq. (f29f 
Q {fhJ (e)+b-AR {fh3 (N,e) 

= F mij( e ) - H {f}\z-( e ) 

+ b-^AHi(N,e) + AHl f}lJ (N,e) 

= min (31) 

a simplified approximative representation can be given 
by 

H {f}lJ (e) + b ■ AH {m {N, e) L min , (32) 

because JT^| Z -(e) and AHi(N,e) are independent of J 

and hence act as constants for given resolution e. Eq. ((32|) 
is a very good approximation of Eq. (|3~lj) , since AH\ (N, e) 
is in general small compared to AHyy\j(N,e). The in- 
terpretation of this approximation of the criterion is that 
the value of the conditional entropy including its statis- 
tical error has to be minimal. 



VII. EXAMPLES 

In order to evaluate the ability of the introduced cri- 
terion for determination of optimal perforated Markov 
approximations, it is tested on data sets, for which the 
structure of dependence is known. The test is carried 
out with linear stochastic and with nonlinear determin- 
istic dynamics. In the context of the example of the au- 
toregressive process furthermore the dependence of the 
output of the criterion on the length of the underlying 
dataset is explicitly addressed. 

A. Autoregressive processes 

1. Suggestion of the optimal perforated Markov 
approximation and comparison with the memory structure 
underlying the dataset 

The map of AR processes was given in Eq. (l2"TT) . For 
the first analysis a dataset of N = 40000 data points is 
generated for a simple autoregressive process with pa- 
rameters ao = a,2 = 0.4, which fixes the structure of 
dependence in the iteration procedure. Parameters not 
mentioned are understood to be zero. The time step 
with index T' depends on the time steps given by the set 
Jo = {—2,0}. A full search for the e-dependent optimal 
conditioning structure according to the criterion stated 
in Eq. (f29|) is carried out, where additionally in case of 
the estimational result Hy^j(e) < if^| Z -(e) as a con- 
sequence of statistical fluctuations, what is theoretically 
impossible, the estimated value of -ff{j}|j(e) was replaced 
by the value of H^, z -(e), thus from Eq. ([26|) avoiding 

negative Q{/pj(e) in Eq. ([29|) . The result is shown in 
FigH 
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FIG. 2: Resolution-dependent optimal perforated Markov 
model for a dataset of N = 40000 data points of an AR(3) 
process with an = a2 = 0.4. The optimal conditioning struc- 
tures can be found in vertical direction. 

A first result is that the found optimal conditioning 
structure J*(e) is indeed resolution-dependent. Inter- 
preting FigfSJ it is possible to extract three regimes: 

For high resolution, i.e. small e, the statistical errors of 
the entropy estimations arc rather large, because in par- 
ticular for longer conditioning fewer neighboring delay 
vectors for the estimation of the correlation sum can be 
found, and hence the criterion is dominated by the sta- 
tistical error of the redundancy, which causes perforated 
structures with fewer elements to be detected as optimal. 
Interestingly, the single conditioning on the further past 
is selected as superior to single conditioning on the pres- 
ence. The statistical errors of -ff{i}|{o} an d of i } | -[ — 2 } 
are about the same, but the conditional entropy is esti- 
mated slightly smaller in the latter case. 

For intermediate resolutions, the most interesting part 
of the plot, the model behind the dataset is found, i.e., 

J*(e) = J , (33) 

because the statistical error is sufficiently small and the 
resolution is sufficiently large that the information term 
dominates the criterion without being disturbed by ei- 
ther statistical or resolution effects. Nevertheless, in this 
domain of resolutions the statistical error of the redun- 
dancy has the task to exclude all conditionings longer 
than necessary among those which are equal and opti- 
mal from the informational point of view. 

For even coarser resolution, there is the domain of 
coarse graining splittings. The statistical error typically 
does not play a role anymore and the information term 
becomes influenced by resolution effects. Even though 
the components do not carry information from the dy- 
namical law, analyzing the dataset with coarse resolution 
they are frequently chosen to appear in the optimal per- 
forated Markov model. This is the same effect as shown 
already for the usual Markov approximation of the au- 
toregressive process in FigJTJ 

The criterion is tested by the question if the compo- 
nents of conditioning Jq in the dynamics behind the gen- 



8 



erated dataset can be retrieved. In the rather simple 
case of autoregressive processes hence the criterion can 
be applied successfully. 

An alternative and widely used approach to Gaussian 
time series data is to directly fit the parameters of a linear 
model Eq.(20) to them. Such routines minimize the vari- 
ance of the residuals without making use of information 
theoretic concepts. For pre-selected model order p, using 
the whole dataset each single parameter is estimated 
explicitly, instead of only selecting components. Hence, 
such fitting methods appear to be superior to what we 
propose here, and indeed for data from AR-processes, 
they are superior. However, firstly, our goal is to identify 
the relevant components of a delay vector, which includes 
the determination of the model order p, without pre- 
selection. Secondly, our approach is neither restricted 
to data from linear models nor to Gaussian data, but de- 
velops its full strength for nonlinear (stochastic) systems, 
as we will demonstrate later. 



Dataset length dependence of the optimal perforated 
Markov model 



Since an essential part of the criterion (|29|) consists of 
a statistical error it is immediately clear that the result 
always depends crucially on the length of the underlying 
dataset. The consequences of the influence of the length 
of the dataset are outlined in the following. As the ana- 
lyzed example a special AR(7) process 



x n +i = 0.3 x n + 0.3 x n -4 + 0.3 x„_ 6 + 



(34) 



is used, for which the structure of dependence in the 
iteration procedure can be described by the set Jo = 
{—6,-4,0}. With this iteration procedure data sets 
of different lengthes (N=3000, 8000, 20000, 50000) are 
generated and then analyzed with respect to the opti- 
mal resolution-dependent perforated Markov approxima- 
tions. The results are shown in Fig|3J It is possible to 
conclude that in the case of longer data sets the time 
steps of memory in the used dynamics can be retrieved 
on a broader interval of resolutions with higher reliability. 
For shorter data sets the influence of the statistical error 
in the criterion (|29[) increases and the domain of domi- 
nance of the information term is shifted to coarser reso- 
lutions seen in the selection of fewer components for the 
optimal model, where the statistical error term is domi- 
nant. The structure of conditioning J of the underlying 
dynamics becomes blurred, if the domain of dominance of 
the statistical error starts to touch the domain of coarse 
graining effects for sufficiently short data sets. In the 
bottom right panel of FigJ3]this case is almost reached. 

In the hypothetical case of infinite dataset length all 
statistical errors become zero for all resolutions and the 
criterion (|29|) is governed by the ignored memory. Op- 
timality is selected for minimal modelling error quanti- 
fied by vanishing ignored memory. If memory ranges in- 
finitely far into the past, then a Markov approximation is 
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FIG. 3: Resolution-dependent optimal perforated Markov 
models of an AR(7) with coefficients an = 04 = a& — 0.3 un- 
der changed length N of data sets. The results are obtained 
from a full loop over all possible conditionings restricted only 
by the maximal number of 10 past time steps. 



always accompanied by a loss of information. According 
to the criterion a Markov approximation of finite order 
can thus not be selected as optimal. If the range of mem- 
ory is finite into the past, a Markov approximation is pos- 
sible where no information is found in the further past, 
but it would not be necessary, because components of the 
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past without information about the future nevertheless 
kept do not diminish the quality of the model with re- 
spect to the first part of the criterion in Eq. (|29| in case of 
infinite data sets. The second part of the criterion given 
in Eq. ([30)) decides for the shortest conditioning in the set 
of degenerated selected perforated Markov models. 



B. Mackey-Glass dynamics 

As a second example for testing the performance of the 
criterion (I29| we analyze the Mackey-Glass dynamics [l2j 
given by 



. , . ax(t — t) , , . 

x(t) = '— - bx(t) 

W 1 + [x(t - t)] c K ' 



(35) 



a time-continuous nonlinear deterministic example with 
memory. The state at time t depends explicitly on the 
state at time t — r. Mackey-Glass dynamics is a represen- 
tative of the class of delay differential equations, a subset 
of the set of infinite-dimensional dynamical systems, ft 
serves as a model for the regeneration of white blood cells 
for patients with leucemia. Discretized, the equation of 
motion reads 



(-71+1 



(1 - bAt)x r , 



-At 



with the delay 



k = -r- € N 
At 



Typical parameter values [TH, [T3] are 

a = 0.2, b = 0.1, c=10 



(36) 



(37) 



(38) 



As an example taking At = 0.01 time units, a delay of 
e.g. k = 1800 time steps leads to a time delay of r = 18 
time units. For r > 16.8 time units it is known that the 
dynamics is essentially chaotic. Using every 300th time 
step in the dataset to analyze leads to an effective delay 
of K — 6 time steps. For the following analysis, data sets 
of 12000 effective data points are used. Even though in 
FigH]for usual (nonperforated) conditional entropies the 
delay is invisible, the entropic-statistical criterion (|2"9"|) 
selects it. This is seen in FigJSl where a whole series of 
optimal perforated Markov models for different effective 
delays is shown. The right part of the panels is again 
subject to coarse graining effects. For higher resolution 
more structure is visible. The most important point to 
stress is that all panels have in common that there is 
an interval of resolutions, where the optimal perforated 
Markov model contains omissions behind the first step of 
conditioning and the first following time step taken into 
account is exactly the time step corresponding to the 
effective delay of the dynamics. The index is always 
part of J*(e), because of the a;„-term in Eq. (|3"6"|) . 

Concluding, the very long range of the memory of 
the Mackey-Glass system requires strong downsampling, 




o.ooi 



FIG. 4: (Color online) Conditional entropies of the Mackey- 
Glass dynamics with effective delay K = 6 



from which the complication arises that the resulting ef- 
fective memory underlies some smearing effects. Never- 
theless, since it was detected by the criterion, also this 
example has to be interpreted as a successful test of the 
criterion. 

Without going into detail here it should be mentioned 



that in 11| various further variants for the selection of 



conditioning components as e.g. a restricted cardinality 
of conditioning components or a priori omissions of in- 
dices were suggested, in order to reach the further past 
for detection of potential memory. 



VIII. CONSEQUENCES FOR PREDICTION 
A. Point prediction and prediction error 

General point prediction one time step into the future 
reads 



t-rt+l 



F(x„ 



(39) 



with a suitably chosen function F. The average quality 
of predictions can be evaluated by an accuracy measure. 
We choose the root mean squared (rms) prediction error 
given by 



(40) 



As a consequence the mean value of the estimated dis- 
tribution of X n+ i, the random variable corresponding to 
the measured value x n +i, is the optimal F. This distri- 
bution is estimated by a selected set of Xk+i, which are 
obtained from those x*,, which are in some sense suit- 
ably related to x„. A decision, what a 'suitable relation' 
should be, is not immediately given by Eq. (|4T))) and has 
to be made additionally. Another possible accuracy mea- 
sure could be the mean absolute error, which would lead 
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FIG. 5: Resolution-dependent optimal perforated Markov 
models for the Mackey-Glass dynamics with different effec- 
tive delays K. 



to an optimal F given by the median. In general the 
prediction error depends on the lead time (time into the 
future), the dataset length N, the noise in the dynamical 
modeling F and possibly on the resolution e. 



B. Locally constant prediction with generalized 
delay vectors 

A special point prediction used in the following, which 
is locally constant (cmp. the zeroth order predictor in 
[lj|) and perforated, reads 



Efc^„©(e- ||Pjx n -Pjx fc ||) -x k+1 



Ek?n ©(^ - \\Pj*n - PjXfc 

1 



K-Pjx^„ e W(e,Pjx„)} 



E 



Xk+1 ■ 



Pjx fc ^„GW(e,Pjx„) 



(41) 



Pj is the projection operator onto the perforation struc- 
ture given by the set J already encountered in SecfVland 
U(e, Pjx„) is the e-neighborhood of the vector Pjx„ in- 
troduced in Eq. (fT0|) . Apart from the perforatedness the 
method is also called the Lorenz method of analogues: 
The predicted future value is the mean of the known fu- 
tures of similar states from the past. We will study ex- 
plicitly the resolution dependence of the corresponding 
prediction error: 



e(e) 



n+l 



£«+i(e)) 2 



(42) 



C. Example: Prediction from optimal perforated 
Markov model for generalized Henon dynamics 

After having found the optimal resolution-dependent 
perforated Markov models from the criterion ([29]) with 
the suitable balance b, the corresponding component 
structures given by J*(e) can be used for the calcula- 
tion of point predictions according to Eq. (141|) and rms 
prediction errors according to Eq. (|42]) . In the following 
the prediction error corresponding to the optimal per- 
forated Markov model, i.e., conditioning in the sense of 
the optimal generalized delay vector (GDV), is compared 
to the minimum of the prediction error of usual standard 
embeddings (1-5 delay vector (DV) components in pres- 
ence and past; delay of 1 - 5 time steps). As an example 
we treat the generalized Henon map 



Vn+l 



Vn-K+2 ~ Cy n -K+1 



(43) 



a simple chaotic system introduced by Baier&Klein in 
(l6| . In general it contains longer memory than the usual 
Henon map 



1 



ax„ 



[3x ri 



(44) 



which is obtained from the generalized Henon map in the 
case of K = 2 from the transformation y — ax, a = a 
and c = — p. The nonlinearity still arises from one single 
quadratic term. The coefficients are chosen to be a = 
1.76 and c = 0.1. From comparison of the coefficients 1 
vs. c of the non-constant terms of Eq. (|4"3")) it is possible 
to see in this case that the linear term is suppressed in 
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importance. From the choice of the delay K — 4 the 
structure of dependence can be indicated by the set Jo = 
{-3,-2}. 

In Figj6] results for prediction (lower panel) from opti- 
mal perforated Markov models (upper panel) are shown 
for the generalized Henon map for a balance factor of 
b = 4 in (|29p which favors models with fewer components. 
It is seen that the prediction error from the optimal per- 
forated Markov model is smaller than the minimum of 
prediction errors from standard embeddings. This serves 
as a justification for the introduction of perforatedness 
into the framework of Markov approximations and for 
practical applicability of the criterion (|29|) for prediction 
purposes. 
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FIG. 6: (Color online) Upper panel: Optimal resolution- 
dependent perforated Markov model (J*(e)) for a dataset of 
N — 10000 data points of the generalized Henon with delay 
K — 4. Lower panel: Resolution-dependent prediction error 
e op tGDv(e) from J*(e), minimal prediction error e optstdDV (e) 
of standard delay vectors and relative rank of the prediction 
error from J*(e) in the list of prediction errors from standard 
embeddings. 



IX. CONCLUSION 

For dynamics with potentially infinite memory, e.g. 
from projection of stochastic dynamics into one measure- 
ment quantity, novel criteria for optimal Markov approx- 
imations were introduced. It was realized that essentially 
two types of errors are relevant: First, a modelling error, 
quantified by the ignored memory, and second, a sta- 
tistical error of uncertainty reduction, quantified by the 
statistical error of the redundancy. 

Usually Markov approximations are accompanied by 
losses of information, which become less the more mem- 
ory is taken into account. Exactly the opposite holds 
for the statistical error of the uncertainty reduction, be- 
cause a larger Markov order causes stronger restrictions 
in neighbor search algorithms responsible for larger sta- 
tistical errors in the estimation of entropies and hence 
also of the redundancy. The rather simple idea behind 
the criterion for usual Markov approximations is that it 
makes no sense to further reduce ignored memory if the 
statistical error of the uncertainty reduction is already 
larger. Here the monotony properties of the involved 
quantities were used in the mathematical formulation of 
the criterion. 

Even though this criterion was successfully applica- 
ble on simple dynamics, problems arise from the huge 
statistical errors for high cardinality of conditioning sets 
for dynamics with long range and inhomogeneously dis- 
tributed memory. Hence, a generalization to a perforated 
case, where omissions of time steps in the past have to be 
allowed, was needed. A generalized notational framework 
of information theory in time series analysis was devel- 
oped, which in principle allows for a unified description 
of variable future time steps ahead, jointly conditioned 
joint uncertainties, regular perforation (downsampling) 
and arbitrary irregular perforation with the tools of in- 
formation theory. On this basis a novel criterion for opti- 
mal perforated Markov approximations was introduced, 
in which the selection algorithm for relevant condition- 
ing components took into account the nonexistence of 
monotony properties of the modelling error in the cardi- 
nality of the conditioning set. 

The perforated criterion was successfully tested for lin- 
ear stochastic (AR) and nonlinear deterministic (Mackey- 
Glass) dynamics. It was found that the optimal perfo- 
rated Markov model is resolution-dependent. For certain 
intervals of intermediate resolution the memory structure 
of the dynamical law was retrieved by the suggested crite- 
rion indicating the functional capability to yield suitable 
Markov approximations. For small resolutions coarse 
graining effects are clearly seen and for fine resolutions 
from statistical reasons fewer conditioning components 
are selected. The importance of the dependence on the 
length of the underlying dataset was pointed out. 

Since the methods are based exclusively on quantities 
from information theory and statistical errors in their 
estimation, in particular the perforated variant is appli- 
cable to a broad class of dynamics. This is especially 
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useful for an analysis of data sets, where it is not allowed 
to assume nice properties like, e.g., linearity. The ex- 
plicit calculation of the statistical error of entropies made 
accessible those criteria based only on entropies and its 
derived quantities. In spite of the success of the criterion 
on the example dynamics, it has to be mentioned that 
nonstationarity and intermittency still remain as prob- 
lems. 



For locally constant and perforated point prediction an 
explicitly resolution-dependent root mean squared pre- 
diction error was introduced. For certain resolutions 
an improvement of the rms prediction error from the 
resolution-dependent optimal perforated Markov model 
in comparison with the rms prediction error from stan- 
dard embeddings was seen in the example of the gener- 
alized Henon map. 
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